Multimedia Event Detection and Recounting MED and MER
نویسندگان
چکیده
We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level, and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific tiling and the Gaussian Mixture Model Super Vector. In the detector training and fusion, two classifiers and three fusion methods are employed. The results from both the official sources and our internal evaluations show good performance of our system. Our MER system utilizes a subset of features and detection results from the MED system from which the recounting is generated. 1. MED System 1.1 Features In order to encompass all aspects of a video, we extracted a wide variety of low-level and highlevel features. Table 1 summarizes the features used in our system. Among those features, most of them are widely used features in the community, for example, SIFT, STIP and MFCC. We extracted those features using standard code available from the authors with default parameters. Table 1: Features used for MED’12 system Visual Features Audio Features Low-level features 1. SIFT (Sande, Gevers, & Snoek, 2010) 2. Color SIFT (CSIFT) (Sande, Gevers, & Snoek, 2010) 3. Motion SIFT (MoSIFT) (Chen & Hauptmann, 2009) 4. Transformed Color Histogram (TCH) (Sande, Gevers, & Snoek, 2010) 5. STIP (Wang, Ullah, Klaser, Laptev, & Schmid, 2009) 6. Dense Trajectory (Wang, Klaser, Schmid, & Liu, 2011) 1. MFCC 2. Acoustic Unit Descriptors (AUDs) (Chaudhuri, Harvilla, & Raj, 2011) High-level features 1. Semantic Indexing Concepts (SIN) (Over, et al., 2012) 2. Object Bank (Li, Su, Xing, & Fei-Fei, 2010) 1. Acoustic Scene Analysis Text Features 1. Optical Character Recognition 1. Automatic Speech Recognition Besides those common features, we have two home-grown features which are Motion SIFT (MoSIFT) and Acoustic Unit Descriptors (AUDs). We will introduce these two features in the following subsections. 1.1 .1 Motion SIFT (MoSIFT) Feature The goal of developing the MoSIFT feature is to combine the features from the spatial domain and the temporal domain. Local spatio-temporal features around interest points provide compact and descriptive representations for video analysis and motion recognition. Current approaches tend to extend spatial descriptions by adding a temporal component to the appearance descriptor, which only implicitly captures motion information. MoSIFT detects interest points and encodes not only their local appearance but also explicitly models local motion. The idea is to detect distinctive local features through local appearance and motion. Figure 1 demonstrates the MoSIFT algorithm. Figure 1: System flow chart of the MoSIFT algorithm. The algorithm takes a pair of video frames to find spatio-temporal interest points at multiple scales. Two major computations are applied: SIFT point detection and optical flow computation according to the scale of the SIFT points. For the descriptor, MoSIFT adapts the idea of grid aggregation in SIFT to describe motions. Optical flow detects the magnitude and direction of a movement. Thus, optical flow has the same properties as appearance gradients. The same aggregation can be applied to optical flow in the neighborhood of interest points to increase robustness to occlusion and deformation. The two aggregated histograms (appearance and optical flow) are combined into the MoSIFT descriptor, which now has 256 dimensions. 1.1 .2 Acoustic Unit Descriptors (AUDs) We have developed an unsupervised lexicon learning algorithm that automatically learns units of sound. Each unit is such that it spans a set of audio frames, thereby taking local acoustic context into account. Using a maximum-likelihood estimation process, we can learn a set of such acoustic units unsupervised from audio data. Each of these units can be thought of as low-level fundamental units of sound, and each audio frame is generated by these units. We refer to these units as Acoustic Unit Descriptors (AUDs) and we expect that the distribution of these units will carry information about the semantic content of the audio stream. Each AUD is represented by a 5-state Hidden Markov Model (HMM) with a 4-gaussian mixture output density function. Ideally, with a perfect learning process, we would like to learn semantically interpretable lowerlevel units, such as a clap, a thud sound, a bang, etc. Naturally, it is hard to enforce semantic interpretability on the audio learning process at that level of detail. Further, because the space of all possible sounds is so large, many different sounds will be mapped into single sounds at learning time, since we can only learn a finite set of units. 1.2 Feature Representat ions In the previous section, we briefly describe the features we used in the system. In this section, we will describe the representations we used for the raw features extracted in Section 1. Three representations were used in our system. They were K-means based spatial bag-ofwords model with standard tiling (Lazebnik, Schmid, & Ponce, 2006), K-means based spatial bag-of-words with feature and event specific tiling (Viitaniemil & Laaksonen, 2009) and Gaussian Mixture Model Super Vector (Campbell & Sturim, 2006). Since the K-means based spatial bag-of-words model with standard tiling and Gaussian Mixture Model Super Vector are standard technology, we will focus on the K-means based spatial bag-of-words model with feature and event specific tiling. For simplicity, we will refer to it as tiling. Spatial bag-of-words model is a widely used representation of the low-level image/video features. The central idea of the spatial bag-of-words model is to divide the image into some small tiles and compute bag-of-words for each tile. Figure 2 shows a couple of tiling examples. Figure 2: Examples of tiling In general, the standard spatial bag-of-words tiling uses the 1x1, 2x2 and 4x4 tiling. However the use of those tilings is ad-hoc and some preliminary works have shown that other tilings might produce better performance (Viitaniemil & Laaksonen, 2009). In our system, we systematically tested 80 different tilings to select the best one for each feature and each event. Table 2 shows the performance of feature specific tiling v.s. the standard tiling. The scores are computed from our internal experiments and are the average over 20 MED12 pre-specified events. The PMiss @ TER=12.5 metric is an official evaluation metric specified in the MED 2012 Evaluation Plan. A smaller PMiss score signifies better performance. From the table, we can see clearly that for all of the five features, the feature specific tiling performs consistently at least 1% better than the standard tiling. Table 2: The performance of feature specific tiling and standard tiling Feature SIFT CSIFT TCH STIP MOSIFT Feature Specific Tiling 0.4209 0.4496 0.4914 0.5178 0.4330 Standard Tiling 0.4325 0.4618 0.5052 0.5234 0.4456 Figure 3 shows an example of the performance of event specific tiling v.s. standard tiling on Event 25 (marriage proposal), which is a difficult event identified in our experiments. It can be seen clearly that the event specific tiling can noticeably improve the performance over standard tiling. Figure 3: The comparison of event specific tiling and standard tiling on Event 25 1.3 Training and Fusion We used the standard MED’12 training dataset for our internal evaluation and the training of the models for our submission. For our internal evaluation, the MED’12 training dataset was further divided into the training set and testing set by randomly selecting half of the positive examples into the training set and the other half into the testing set. The negative examples consisted of only NULL videos which do not have label information. The two classifiers used in the system were kernel SVM and kernelized rigid regression. For simplicity, we will refer to it as kernel regression. For the K-means based feature representations, we used the Chi-squared kernel. For the GMM based representation RBF kernel was used. The parameters of the model were tuned by 5-fold cross validation and the PMiss @ TER = 12.5 metric was used as the evaluation metric. For combining features from multiple modalities and the outputs of different classifiers, we used fusion and ensemble methods. More specifically, for the same classifier, we used three fusion methods to fuse different features. The fusion methods were early fusion, late fusion and double fusion (Lan, Bao, Yu, Liu, & Hauptmann, 2012). In early fusion, the kernel matrices from different features were normalized first and then combined together. In late fusion, the prediction scores from the models trained using different features were combined. In our system, we also used a fusion method called double fusion, which combines early fusion and late fusion together. Finally, the results from different classifiers were ensembled together. Figure 4 shows the diagram of our system. Figure 4: The diagram of the system 0.55 0.57 0.59 0.61 0.63 0.65 0.67 0.69 0.71 0.73 0.75 CSIFT SIFT MOSIFT STIP TCH PM is s@ 12 .5 E025 Marriage_proposal baseline 1.4 Submiss ion In the following section we describe in detail the runs we submitted to NIST. Table 3 shows the official performance of each submission. 1.4 .1 Pre-Specified Submission 1.4.1.1 Submission 1: CMU_MED12_MED12TEST_PS_MEDFull_EKFull_AutoEAG_p_ensembleKRSVM_1 In this submission, using the features described in the previous section, we did the following to generate this run: 1. For each feature, train a SVM classifier and a kernel regression model. 2. Late fusion of all the results from SVM classifiers and kernel regression respectively. 3. Early fusion of all features except ASR. 4. Train a SVM classifier and a kernel regression model using 3 respectively. 5. Double fusion of SVM classifiers in 2 and 4. 6. Double fusion of kernel regression model in 2 and 4. 7. Ensemble of 5 and 6. 1.4.1.2 Submission 2: CMU_MED12_MED12TEST_PS_MEDFull_EK10Ex_AutoEAG_c_KRLF_1 1. For each feature, train a kernel regression model. 2. Late fusion of all the results from 1. 1.4.1.3 Submission 3: CMU_MED12_MED12TEST_PS_MEDFull_EKFull_AutoEAG_c_SVMLF_1 1. For each feature, train a SVM classifier. 2. Late fusion of all the results from 1. 4.1.4 Submission 4: CMU_MED12_MED12TEST_PS_MEDFull_EKFull_AutoEAG_c_BOB_1 1. Form all prediction results from step 1-7 in Submission 1 into a pool. 2. For each event, find the candidate in the pool which has the best performance. 3. Combine the candidates of each event together to form the submission. 1.4.2 Ad-Hoc Submission 1.4.2.1 Submission 5: CMU_MED12_MED12TEST_AH_MEDFull_EKFull_AutoEAG_pSVM_1 The following features were used: SIFT, CSIFT, Transformed Color Histogram (TCH), Motion SIFT (MoSIFT), STIP, Dense Trajectory (DT), MFCC, SIN and Object Bank. Different from our pre-specified EKFull submission, we did not use GMM Super Vector and tiling representations. To get the detection results, the following steps were performed, which pretty much followed the pre-specified submission: 1. For each feature, train a SVM classifier. 2. Late fusion of the scores of each feature obtained from step 1. 3. Early fusion of the distance matrices of all the visual and acoustic features, and then use the obtained distance matrix to compute the kernel matrix. 4. Train a SVM classifier based on the kernel obtained by step 3. 5. Double fusion of the results from step 2 and step 4. 1.4.2.2 Submission 6: CMU_MED12_MED12TEST_AH_MEDFull_EK10Ex_AutoEAG_cKR_1 Same features as Submission 5 were used in this submission. In our previous experiment, SVM tends to over fit the limited positive exemplars. Thus for EK10 we used kernel regression with Chi-squared kernel as the classifier. As we only have 10 positive exemplars for training, it is trickier to tune the regularization parameter of kernelized rigid regression by cross-validation. We have observed in our experiment that fixing the parameter to 1 usually yields good performance, though not necessarily the best. We therefore set regularization parameter as 1 for all the events. To get the detection results, the following four steps were performed, which pretty much followed the pre-specified submission: 1. For each feature, train a kernel regression model. 2. Late fusion of the prediction scores of each feature obtained from step 1. 3. Early fusion of the distance matrices of all the visual and acoustic features, and then use the obtained distance matrix to compute the kernel. 4. Train a rigid regression classifier based on the kernel obtained by step 3. 5. Double fusion of the scores obtained from step 2 and step 4. Table 3: The official performance of the 6 submissions Task Train Type EAG SYSID NDC PFa PMiss Pre-Specified EKFull AutoEAG p-ensembleKRSVM_1 0.637 0.0341 0.2113 EKFull AutoEAG c-SVMLF_1 0.6584 0.0341 0.2325 EKFull AutoEAG c-BOB_1 0.6427 0.0341 0.2168 EK10Ex AutoEAG c-KRLF_1 0.8588 0.0345 0.4286 Ad-Hoc EKFull AutoEAG p-SVM_1 0.6494 0.0354 0.208 EK10Ex AutoEAG c-KR_1 1.1078 0.0568 0.3982 2. MER System 2.1 Features We included the following aspects in our MER submission: • Relationships o Visual features that are relevant to the event o Audio features that are relevant to the event o Co-occurrence of the visual concepts (SIN’11) • Observations o Event-Relevant Visual Concepts o Video-Distinctive Visual Concepts o ASR Transcripts o Event-Specific Object Bank Results o Audio Concepts (Noisemes) 2.2 Visual and Audio Concepts We use the histogram of each video semantic class aggregated over the whole video clip. To use the visual concepts, we first generated a bipartite graph matching of Object Bank classes and SIN’11 concepts for the MED12 dataset. The process flow is shown in the Figure 5. Figure 5: Flow chart of visual and audio concepts processing The Noiseme semantic audio concepts similarly indicate “non_linguistic_audio” information in the video. (e.g. “speech”, “music”, “noise” etc.). We use the histogram of each audio concept in the video to mention that in this video we can mainly hear the sound of music, singing or noise. We again use Bipartite Graph Matching to map the Noisemes to the events. All the audio concepts are ranked based on their percentage in the video. 2.3 ASR Transcripts Automatic speech recognition transcripts that indicate “linguistic_audio” information in the video. (e.g. “okay”, “hello”, “she didn’t” etc.). We use TF-IDF according to the word-level ASR confidence to calculate the relevant of each ASR word result to the event kit. We then rank the ASR Transcripts according to their relevance to the event. 2.4 An Example of Our Recounting Submission The requirements of the submission were that all Multimedia Event Recounting (MER) participants are required to produce a recounting for 30 selected video clips where it is known that the clip contains a specific MER event. There will be five events chosen from the MED prespecified events list, and six video clips per event. The system's recounting summarizations were to be evaluated by a panel of judges. An example of our recounting submission is shown in Figure 6. Figure 6: An example of our MER submission 2.5 Performance Table 4 shows the performance of our MER system compared to the average performance of other submitted systems. We achieve significantly better performance in the two MER tasks, which shows the effectiveness of our MER system. Table 4: Official performance of MER MER-to-Event MER-to-Clip Combined (0.4*E+0.6*C) Average of submitted systems 68.75% 43.72% 0.54 CMU_ELamp-MER-System 85.56% 66.30% 0.74
منابع مشابه
TRECVID 2012 GENIE: Multimedia Event Detection and Recounting
Our MED 12 system is an extension of our MED 11 system [11], and consists of a collection of lowlevel and high-level features, feature-specific classifiers built upon those features, and a fusion system that combines features both through mid-level kernel fusion and score fusion. We have incorporated large number of audio-visual features in our new system and incorporated diverse types of stand...
متن کاملInformedia@trecvid 201 4 Med and Mer Med System
We report on our system used in the TRECVID 2014 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. On the MED task, the CMU team achieved leading performance in the Semantic Query (SQ), 000Ex, 010Ex and 100Ex settings. Furthermore, SQ and 000Ex runs are significantly better than the submissions from the other teams. We attribute the good performance to 4 main compone...
متن کاملInformedia E-Lamp @ TRECVID 2013: Multimedia Event Detection and Recounting (MED and MER)
We report on our system used in the TRECVID 2013 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of four main steps: extracting features, representing features, training detectors and fusion. In the feature extraction part, we extract more than 10 low-level, high-level, and text features. Those features are then represented in three different w...
متن کاملTRECVID 2013 GENIE: Multimedia Event Detection and Recounting
Our MED 13 system is an extension of our MED 12 system [12, 13], and consists of a collection of lowlevel and high-level features, feature-specific classifiers built upon those features, and a fusion system that combines features both through mid-level kernel fusion and late fusion. Our MED submissions include total of 24 different configurations which consist of combinations of 2 submission ti...
متن کاملSRI-Sarnoff AURORA System at TRECVID 2012 Multimedia Event Detection and Recounting
In this paper, we describe the evaluation results for TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks as a part of SRI-Sarnoff AURORA system that is developed under the IARPA ALDDIN program. In AURORA system, we incorporated various low-level features that capture color, appearance, motion, and audio information in videos. Based on these low-level featu...
متن کاملTRECVid 2012 Experiments at Dublin City University
Following previous participations in TRECVid, this year, the DCU-IAD team participated in four tasks of TRECVid 2012: Instance Search (INS), Interactive Known-Item Search (KIS), Multimedia Event Detection (MED) and Multimedia Event Recounting (MER).
متن کامل